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We have exactly enumerated all sequences and conformations of HP proteins with chains of up to 
19 monomers on the simple cubic lattice. For two variants of the hydrophobic-polar (HP) model, 
where only two types of monomers are distinguished, we determined and statistically analyzed 
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order to be able to perform these exact studies, we applied an efficient enumeration method based 
on contact sets. 
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I. INTRODUCTION 

Real proteins are build up of sequences of amino acids 
covalently linked by peptide bonds. Twenty different 
types of amino acids occurring in protein sequences are 
known. For a protein consisting of N amino acid residues 
there are thus in principle 20^ possibilities to form se- 
quences or primary protein structures. Single domain 
polypeptides usually possess N = 30 . . . 400 residues; 
proteins built up of several domains can consist of up 
to 4000 amino acids. Only a few of the 20^ possible 
proteins, however, are actually realized in nature and are 
functional in a sense that they fulfil a specific task in 
a biological system. This requires the native structure 
of the protein to be unique and stable against moderate 
fluctuations of the environmental chemical and physical 
conditions. It is widely believed that the native state re- 
sides in a deep funnel-like minimum of the free energy 
landscape 0. Since the energy of a protein depends on 
its sequence, it seems plausible that only such sequences 
of residues are favored whose associated energy landscape 
shows up a pronounced global minimum. On the other 
hand, from the conformational point of view, it can be es- 
timated that the number of structures proteins typically 
fold into, is only of order 1000 - this is at least two orders 
of magnitude less than the number of known proteins. 

Hence, exposing the nature of the relationship between 
sequences (primary structure) and conformations (sec- 
ondary and higher structures) is one of the main aspects 
in protein research 0. Attacking this general problem 
by means of computer simulations based on realistic in- 
teractions is currently impossible. There are two ma- 
jor reasons being responsible for this. Firstly, the pre- 
cise form of the energy function in an all-atom approach 
containing all molecular and nuclear interactions within 
the polypeptide as well as the influence of the solvent is 
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still under consideration. An important question is what 
"level of detail" is necessary to model proteins in gen- 
eral. Considering an exemplified sequence of amino acid 
residues, the use of different force fields usually leads to 
different predictions of the native state. Secondly, even 
if a reliable model would exist, the sequence space is too 
large to be completely scanned by enumeration in or- 
der to search for the small number of sequences with 
appropriate free energy landscape (the number of pri- 
mary structures of very short peptides with, say only 10 
residues, is 20 10 » 10 13 ). 

In order to have a chance to perform such an analysis, 
the model must be drastically simplified. The simplest 
model to describe very qualitatively the folding behav- 
ior of proteins is the HP model Q , where the continuous 
conformational space is reduced to discrete regular lat- 
tices and conformations of proteins are modeled as self- 
avoiding walks restricted to the lattice. In this model it 
is assumed that the hydrophobic interaction is the essen- 
tial driving force towards a native fold. It is expected 
that the hydrophobic side chains are screened from the 
aqueous environment by hydrophilic residues. Therefore, 
the sequences of HP proteins consist of only two types 
of monomers (or classes of amino acids), amino acids 
with high hydrophobicity are treated as hydrophobic 
monomers (H), while the class of polar (or hydrophilic) 
residues is represented by polar monomers (P). In or- 
der to achieve the formation of a hydrophobic core sur- 
rounded by a shell of polar monomers, the interaction 
between hydrophobic monomers is attractive in the stan- 
dard formulation of the model. All other interactions are 
neglected. Variants of the HP model also take into ac- 
count (weaker) interactions between H and P monomers 
as well as between polar monomers Q. 

Although it is obvious that this model can describe 
the folding process very roughly only 0, 0, IE 01 : much 
work has been done to find lowest-energy states and their 
degeneracy for given sequences, or in the inverse prob- 
lem, to identify all sequences of given length whose na- 
tive conformation matches a given target structure. As 
simple as this model seems to be, it has been proven 
to be an NP-complete problem in two and three di- 
mensions 0- Therefore, sophisticated algorithms were 
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applied to find lowest energy states for chains of up 
to 136 monomers. The methods applied are based on 
very different algorithms, ranging: from exact enumer- 
ation in two dimensions 0, I and three dimensions 
on cuboid (compact) lattices 3. ITU, and hydropho- 
bic core construction methods 13, LUl over genetic al- 
gorithms EE E E El, Monte Carlo simula- 
tions with different types of move se ts I l9t 
and generalized ensemble approaches 23] to Rosenbluth 
chain growth methods |24[ of the 'Go with the Winners' 
type |H Elm Hall. With some of these algo- 
rithms, thermodynamic quantities of lattice heteropoly- 
mers were studied as well pi I27I I29I I30I l3l| . 

In this work, we apply an exact enumeration method 
to three-dimensional HP proteins being not necessarily 
compact on the simple cubic (s.c.) lattice. For efficiency, 
we enumerated contact sets for chains of given length in- 
stead of conformations. In order to study the interplay 
between sequences and conformations and to investigate 
peculiarities of designing sequences, we perform a sta- 
tistical analysis of the complete spaces of conformations 
and sequences for chains of up to N = 19 monomers. 

In Section [H] we give a review on the two variants of 
the HP model we use in our study, the original HP model 
and a variant taking into account an additional interac- 
tion between hydrophobic and polar residues. This is 
followed by Section IIIII where we discuss self-avoiding 
conformations and contact sets. Then, in Section llVl we 
perform an exact statistical analysis of properties of de- 
signing sequences and native conformations with lengths 
up to 19 monomers in comparison with the bulk of all 
possibilities to form sequences and to generate confor- 
mations, respectively. Since the exact data obtained with 
our algorithm can be rearranged in terms of the energy 
levels of the conformations, we are also able to deter- 
mine the densities of states for all sequences. This allows 
for the study of the energetic thermodynamic properties 
of sequences whose associated ground state is unique or 
not, and their comparison from a thermodynamic point 
of view. We do just that in Section[V] The paper is then 
concluded by summarising our results in Section IVT1 



II. HP MODELS 

A monomer of a HP sequence <x = (<7i, . . . , <7jv) is 
characterized by its residual type (<7j = P for polar and 
oi = H for hydrophobic residues), the place 1 < i < N 
within the chain of length N, and the spatial position x to 
be measured in units of the lattice spacing. A conforma- 
tion is then symbolized by the vector of the coordinates of 
successive monomers, X = (xi,X2, . .. ,Xjv)- We denote 
by Xij = |x,;— Xj I the distance between the ith and the jth 
monomer. The bond length between adjacent monomers 
in the chain is identical with the spacing of the used reg- 
ular lattice with coordination number k. These covalent 
bonds are thus not stretchable. A monomer and its non- 
bonded nearest neighbors may form contacts. Therefore, 



the maximum number of contacts of a monomer within 
the chain is (k — 2) and (k — 1) for the monomers at 
the ends of the chain. To account for the excluded vol- 
ume, lattice proteins are self- avoiding, i.e., two monomers 
cannot occupy the same lattice site. The general energy 
function of the non-covalent interactions reads in energy 
units e (we set e — 1 in the following) 



E = e 
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is a symmetric N x N matrix called contact map and 



U a . a , = 



UHH u H p 
UHP Upp 



(3) 



is the 2x2 interaction matrix. Its elements u aiaj cor- 
respond to the energy of HH, HP, and PP contacts. 
For labeling purposes we shall adopt the convention that 
ai = = P and a, = 1 = H . 

In the simplest formulation which we will refer 
to as HP model in the following, only the attractive 
hydrophobic interaction is nonzero, v^f H = — 1, while 
u hp = u hp = q Therefore, Uf*. = S atH S ajH . This 
model has been extensively used to identify ground states 
of HP sequences, some of which are believed to show up 
qualitative properties comparable with realistic proteins 
whose 20-letter sequence was translated into the 2-letter 
code of the HP model H [H Q |H |33|. As we will 
see later on, this simple form of the HP model suffers, 
however, from the fact that the lowest-energy states are 
usually highly degenerate and therefore the number of 
designing sequences (i.e., sequences with unique ground 
state - up to the usual translational, rotational, and re- 
flection symmetries) is very small, at least on the simple 
cubic lattice. 

For a more reliable statistical sequence analysis, we 
compare with another model of HP type, as proposed 
in Ref. 0- This model was motivated by results 
from an analysis of inter-residue contact energies be- 
tween real amino acids |34j . To this end, an attrac- 
tive nonzero energy contribution for contacts between H 
and P monomers is assumed 0. In what follows we 
call this the MHP (mixed HP) model. The elements of 
the interaction matrix © are chosen to be u^^ p = — 1, 
m mhp = _!/ 2 .3 pa -0.435, and uf^ p = 35]. The 
additional H-P interaction breaks conformational sym- 
metries yielding a much higher number of designing se- 
quences on cubic lattices. 



III. SELF-AVOIDING WALKS AND CONTACT 
MATRICES 

Since lattice polymers are self-avoiding walks, the total 
number of conformations for a chain with N monomers 
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FIG. 1: (a) Dependence of the numbers of self-avoiding 
walks C n and contact matrices M n on the number of steps 
n = TV — 1. (b) Ratios of numbers of self-avoiding walks 
r„ = Cn/Cn-i and contact matrices r* f = M n / M n -\. The 
dotted lines indicate the values the respective series converge 
to, 7-£, = [ic ~ 4.68 and = « 4.38, respectively. 



is not known exactly. For TV — > oo it is widely believed 
that in leading order the scaling law [3|| Wi\ 



(4) 



holds, where n is the number of self-avoiding steps (i.e., 
n — TV — 1). In this expression, \xc is the effective coor- 
dination number of the lattice, 7 is a universal exponent, 
and A is a non-universal amplitude. In Tabled we have 
listed the exactly enumerated number for self-avoiding 
conformations of chains with up to TV = n + 1 = 19 
monomers. Based on these data we estimated [ic ~ 4.684 
and 7 ~ 1.16 by extrapolating the results obtained with 
the ratio method j3|| |38| . These results are in good 
agreement with previous enumeration results |39l l40l l4l| . 
Monte Carlo methods and field-theoretic estimates 
for 7 01 . We should note that it is not the aim of this 
work to extend the numbers of walks Cjv in Table^which 
has already been enumerated up to n = 26 steps (and 
hence C27 ~ 5.49 x 10 17 self-avoiding conformations with 
TV = 27 monomers) 40]. Rather, we scan the combined 
space of HP sequences and conformations which contains 
for chains of TV = 19 monomers 2 19 Cig ~ 1.17 x 10 18 pos- 
sible combinations. Therefore, the computational efforts 
in our study are comparably demanding. 



TABLE I: Number of conformations Cjv (with a global sym- 
metry factor of 6 divided out) and contact matrices Mn for 
chains with TV monomers (or, equivalently, self-avoiding walks 
with n ■ 
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In models with the general form 0), where the calcu- 
lation of the energy reduces to the summation over con- 
tacts (i.e., pairs of monomers being nearest neighbors on 
the lattice but nonadjacent along the chain) of a given 
conformation, the number of conformations that must 
necessarily be enumerated can drastically be decreased 
by considering only classes of conformations, so-called 
contact sets [lfl |44|. A contact set is uniquely char- 
acterized by a corresponding contact map (or contact 
matrix), but a single conformation is not. Thus, for de- 
termining energetic quantities of different sequences, it is 
sufficient to carry out enumerations over contact sets. In 
a first step, however, the contact sets and their degener- 
acy, i.e., the number of conformations belonging to each 
set, must be determined and stored. Then, the loop over 
all non-redundant sequences is performed for all contact 
sets instead of conformations. The technical details of 
our implementation will be described elsewhere |45j . 

In Table |U the resulting numbers of contact sets Mm 
are summarized and, although also growing exponentially 
(see Figs.^a) and (b)), the gain of efficiency by enumer- 
ating contact sets, is documented by the ratio between 
Cjv and M/v in the last column. Assuming that the num- 
ber of contact sets M n follows a scaling law similar to 
Eq. Ijljl. we estimated the effective coordination number 
to be approximately fiM ~ 4.38. Unfortunately, the ra- 
tios of numbers of contact sets for even and odd numbers 
of walks oscillate much stronger than for the number of 
conformations, as is shown in Fig.^b). This renders an 
accurate scaling analysis (in particular for the exponent 
7) based on the data for the relatively small number of 
steps much more difficult than for self- avoiding walks. 
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TABLE II: Number of designing sequences Sn (only relevant 
sequences) in the HP and MHP model 



N 



4 5 6 7 



9 10 11 12 13 14 15 16 17 18 19 



30002000 2 011 1 8 29 47 
7 6 13 11 8 124 14 66 97 486 2196 9491 4885 



IV. EXACT STATISTICAL ANALYSIS OF 
DESIGNING SEQUENCES 

In this section, we analyze the complete sets Sjv of 
designing sequences for HP proteins of given numbers 
of residues N < 19. A sequence er is called designing, if 
there is only one conformation associated with the native 
ground state, not counting rotation, translation, and re- 
flection symmetries that altogether contribute on a sim- 
ple cubic lattice a symmetry factor 6 for linear, 24 for 
planar, and 48 for conformations spreading into all three 
spatial directions. In Table[H]we have listed the numbers 
of designing sequences Sn we found for the two models. 
In contrast to previous investigations of HP proteins on 
the square lattice 01 > the number of designing sequences 
obtained with the pure HP model is extremely small on 
the simple cubic lattice. This does not allow for a rea- 
sonable statistical study of general properties of designing 
sequences, at least for very short chains. The situation 
is much better using the more adequate MHP model. 
The first quantity under consideration is the hydropho- 
bicity of a sequence er, i.e., the number of hydrophobic 
monomers Nh, normalized with respect to the total num- 
ber of residues: 



i=l 



(5) 



The average hydrophobicity over a set of designing se- 
quences of given length TV is then defined by 



{m) N 



1 



>N 



E 



m(er). 



(6) 



In Fig. [21 we have plotted (m) jv as function of the se- 
quence length N. The plots do not show up a clear ten- 
dency to what average hydrophobicity they converge for 
long chains. This to know would be, however, of some in- 
terest for the design of a biased algorithm of Monte Carlo 
type that searches the combined sequence and conforma- 
tional space for candidates of designing sequences with 
lengths, where enumeration is no longer applicable. A 
distinct indication that designing sequences have in most 
cases hydrophobicities different from 0.5 could be used 
as a bias in order to reduce the section of the sequence 
space to be scanned, since the number of all possible se- 
quences with given hydrophobicity has a peak at m = 0.5 
(see Fig. |3fa)) which becomes the more pronounced the 
higher the number of residues is. 

It should be noted that the hydrophobicity distribu- 
tion for all these sequences is not binomial since in our 



(m) w 




FIG. 2: Dependence of average hydrophobicities (m)jv for 
designing sequences on the sequence length N in both mod- 
els. For several chain lengths, where no designing sequences 
exist (see Table ITTft . the calculation of the average hydropho- 
bicity © does not make sense. The dashed lines are therefore 
only guides to the eye. 



analysis we have distinguished only sequences that we 
call relevant, i.e., two sequences that are symmetric un- 
der reversal of their residues are identified and enter only 
once into the statistics. Therefore we consider, for exam- 
ple, only 10 relevant sequences with length N = 4 instead 
of 2 4 = 16. Taking into account all 2 N sequences would 
obviously lead to a binomial distribution for Nh, since 
there are then exactly 



N 



(7) 



sequences with Nh hydrophobic monomers. 

In Fig. Eta) we have plotted both, the distribution of 
hydrophobicity of the designing sequences with N = 18 
monomers in the MHP model and, for comparison, the 
distribution of all sequences with N = 18. For this ex- 
ample, we see that the width of the hydrophobicity dis- 
tribution for the designing sequences, which has its peak 
at (m)^ HP « 0.537 > 0.5, is smaller than that of the 
distribution over all sequences. In order to gain more 
insight how the hydrophobicity distributions differ, we 
have compared the widths of both distributions in their 
dependence on the chain length N < 19. This is shown 
in Fig. EJb) . It seems that for N — > oo the widths of the 
hydrophobicity distributions for the designing sequences 
asymptotically approach the curve of the widths of the 
hydrophobicity distributions of all sequences. 

The hydrophobicity profile 



1 



2Sn 



J2 fa + VN-i+i), i = l,2,...,N, (8) 



a-eSj, 



is a measure for the probability to find a hydrophobic 
monomer in a distance i from the nearest end of a de- 
signing sequence. Thus, this quantity gives an impres- 
sion of how the H monomers are on average distributed 
along the chain. In Figs. Ufa) and (b) the profiles for 
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FIG. 3: (a) Distribution of hydrophobicity ftjv of all design- 
ing sequences with N — 18 monomers (solid line) compared 
with the distribution of hydrophobicity of all sequences of this 
length (dashed line) for the MHP model, (b) Widths of the 
hydrophobicity distribution of the designing sequences, few, 
depending on the chain length N (solid line) compared with 
the widths of the hydrophobicity distribution of all sequences 
(dashed line) for the MHP model. 

designing sequences in the MHP model are plotted for 
respective chains with even (N — 14, 16, 18) and odd 
numbers (N = 15, 17, 19) of residues. As a first remark- 
able result, we see that for odd numbers of monomers 
the profile shows up periodic oscillations, i.e., if the nth 
monomer is preferably hydrophobic the (n + l)th residue 
is with lower probability. As this effect is stronger for 
N = 17 than for N = 19, we expect that the amplitude 
of these oscillations decreases with increasing number of 
monomers. The behavior of the chains with even number 
of monomers (Fig.QJa)) is less spectacular. Here, for in- 
creasing number of monomers, the probability seems to 
become more and more equally distributed. Therefore, 
it is more interesting to study how each monomer of the 
designing sequences is involved in the formation of HH 
contacts (as well as HP contacts in the MHP model) be- 
ing favored in low-energy conformations. We define the 
hydrophobic contact density profile by 

1 N 

^ = 2S~ ^ ^ [ Ci 3 ai<J 3 + C N-i+lj <TN-i+l<Jj] , (9) 
N o-GSjv 3 = 1 

where Cy is the contact map defined after Eq. QJ. The 



FIG. 4: Hydrophobicity profiles pi for designing sequences 
with (a) even and (b) odd numbers of monomers in the MHP 
model. Since the profile © is symmetric under i «-> N — i + 1 
we have only plotted it for half the chain. 



TABLE III: Number of designable conformations Dn in both 
models. 
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higher the affinity of the ith monomer to form contacts 
(preferably if it is hydrophobic), the bigger the value of qi. 
This profile is shown in Fig.[S]for both models, where we 
have again separated even and odd numbers of residues. 
From the two profiles for the HP model (N — 18, 19), we 
observe that there is a strong tendency of the monomers 
at the ends of the chain (i = 1, N) to form hydrophobic 
(HH) contacts. The reason is that these two monomers 
can have 5 nearest neighbors on the s.c. lattice, i.e., there 
is one more possibility for them to form a favorable en- 
ergetic contact than for monomers residing within the 
chain. In the MHP model, the behavior is less unique, 
since also HP contacts are attractive and the tendency 
that the ends are preferably hydrophobic is much weaker. 

After having discussed sequential properties of design- 
ing sequences, we now analyze the properties of their 
unique ground-state structures, the native conforma- 
tions. From Table ITTT1 we read off that the number of dif- 
ferent native conformations Dn is usually much smaller 
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FIG. 5: Profiles of the contact density qi for designing se- 
quences with (a) even and (b) odd numbers of monomers in 
the MHP model. For comparison, we have also inserted the 
profiles obtained in the HP model for N = 18 and N = 19, 
respectively. By definition this profile is also symmetric 
under i <-> N — i + 1. 



than the number of designing sequences, i.e., several de- 
signing sequences share the same ground-state conforma- 
tion. The number of designing sequences that fold into 
a certain given target conformation X^ ) (or conforma- 
tions being trivially symmetric to this by translations, 
rotations, and reflections) is called designability |4fij : 



E 



A 



(X gs (<x) 



X (o; 



(10) 



where X gs (<x) is the native (ground- state) conformation 
of the designing sequence cr. The function A(Z) is the 
generalization of Eq. (J2J to 37V-dimensional vectors. It 
is unity for Z = and zero otherwise. 

The designability is plotted in Fig.H3for all native con- 
formations that HP proteins with N = 17, 18, and 19 
monomers can form in the MHP model. In this figure, 
the abscissa is the rank of the conformations, ordered ac- 
cording to their designability. The conformation with the 
lowest rank is therefore the most designable structure and 
we see that a majority of the designing sequences folds 
into a few number of highly designable conformations, 
while only a small number of designing sequences pos- 
sesses a native conformation with low designability (note 
that the plot is logarithmic). Similar results were found, 
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FIG. 6: Designability Fn of native conformations in the MHP 
model for N = 17, 18, and 19. The abscissa is the rank 
obtained by ordering all designable conformations according 
to their designability. 



for example in Ref . |43 , where the designability of com- 
pact conformations on cuboid lattices was investigated 
in detail. The left picture in Fig. [7| shows the conforma- 
tion with the lowest rank (or highest designability) with 
N = 18 monomers. 

From our analysis we see that this characteristic dis- 
tribution of the designing sequences is not restricted to 
cuboid lattices only. This result is less trivial than one 
may think at first sight. As we will show later on in 
this paper in the discussion of the radius of gyration, 
native conformations are very compact, but only very 
few conformations are maximally compact (at least for 
N < 19). For longer sequences similar results were found 
in Ref. 30] . Highly designable conformations are of great 
interest, since it is expected that they form a frame mak- 
ing them stable against mutations and thermodynamic 
fluctuations. Such fundamental structures are also rele- 
vant in nature, where in particular secondary structures 
(helices, sheets, hairpins) supply proteins with a stable 
backbone |47j . 

Conformational properties of polymers are usually 
studied in terms of the squared end-to-end distance 



R, 



(xjv - xi ) 



(11) 





FIG. 7: Structure (N = 18) with the highest designability 
of all native conformations (left) and with minimal radius of 
gyration (right). 
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FIG. 8: (a) Average squared end-to-end distances (Re) of 
native conformations in the MHP model compared with those 
of all self-avoiding walks (SAW). We have also inserted the 
widths b R 2 of the corresponding distributions of end-to-end 
distances, (b) The same for the average squared radius of 
gyration (Rg). Since the radius of gyration is an appropriate 
measure for the compactness of a conformation, we have also 
plotted R? n . for the conformations with the minimal radius 
of gyration (or, equivalently, maximal compactness). 



and the squared radius of gyration 

N 



R 2 g Y 



(12) 



where x = J^. x.i/N is the center of mass of the polymer. 

In polymer physics both quantities are usually referred 
to as measures for the compactness of a conformation. In 
Fig. |Sf a) we compare the iV-dependence of the averages 
of the native conformations found in the MHP model and 
all possible self-avoiding walks. The same quantities for 
the squared radius of gyration are shown in Fig. OUb). 
The averages were obtained by calculating 



« 9 > SAW 



(13) 



xec N 



(R 



2 \MHP 
e,g/ 



s 



Rlj^sH), (14) 



( R e,g) ~ n with v ~ °- 59 ( see Ref - l48| for a re- 
cent summary of estimates for v) , the average end-to-end 
distance {R\ g ) MHP for the native conformations only is 
much smaller. For even number of monomers, the ends 
of a HP protein can form contacts with each other on 
the s.c. lattice. Accordingly, the values of (i?j 9 ) MHP are 
smaller for N being even and the even-odd oscillations 
are very pronounced. The widths (or standard devia- 
tions) bR2 of the distributions of the squared end-to-end 
distances are also very small. Even for heteropolymers 
with N = 19 monomers in total, there are virtually no na- 
tive conformations, where the distance between the ends 
is larger than 3 lattice sites. We have checked this for 
the HP model, too, and found the same effect. Since 
the number of native conformations is very small in this 
model, we have not included these results in the figure. 
Depicting the average squared radius of gyration (R 2 g ) 
and the widths of the corresponding distribution of the 
radius of gyration in Fig. OJb) for all self-avoiding con- 
formations as well as for the native ones, we see that 
these results confirm the above remarks. As the average 
end-to-end distances of native conformations are much 
smaller than those for the bulk of all conformations, we 
observe the same trend for the mean squared radii of 
gyration (R 2 a ) MBP and {R 2 ) SAW and the widths fe^ HP 

y y -tig 

and as well. In particular, the width &^2 HP is so 

small, that virtually all native conformations possess the 
same radius of gyration. For this reason, we have also 
searched for the conformations having the smallest ra- 
dius of gyration R 2 . (these conformations are not nec- 
essarily native as we will see!) and inserted these values 
into this figure, too. We observe that these values dif- 
fer only slightly from (R 2 ) Mlip . Thus we conclude that 
native conformations are very compact, but not neces- 
sarily maximally compact. This property has already 
been utilized in enumerations being performed a priori 
on compact lattices 0,0, E3i where, however, the pro- 
teins are confined by hand to live in small cuboids (e.g. of 
size 3x3x3 or 4x3x3). Our results on the general s.c. 
lattice confirm that this assumption is justified to a great 
extent. Nevertheless, the slight deviation from the mini- 
mal radius of gyration native conformations exhibit is a 
remarkable result as it concerns about 90% of the whole 
set of native conformations! This can be seen in Fig. [51 
where we have plotted the distribution of the squared 
radii of gyration for all self-avoiding conformations with 
N = 18 and the native states in the MHP model. All 
native conformations have a very small radius of gyra- 
tion but only a few of them share the smallest possible 
value. A structure with the smallest radius of gyration 
is shown on the right-hand side of Fig. [7| It obviously 
differs from the most-designable conformation drawn on 
the left of the same figure. 



where C^r is the set of all self-avoiding conformations 
on a s.c. lattice. Figure OJa) shows that, compared to 
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FIG. 9: Distribution h R 2 (normalized to h R 2 = 1) of 
squared radii of gyration (normalized with respect to the 
maximal radius of gyration iig m = (-/V 2 — 1)/12 of a com- 
pletely stretched conformation) of native conformations with 
N — 18 in the MHP model, compared with the histogram 
for all self-avoiding conformations. The vertical line refers to 
the minimal radius of gyration (Rl . IRi — 0.0579 for 
N = 18) and an associated structure is shown on the right- 
hand side of Fig. [7] The inset shows the distribution up to 
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= 0.5. 
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FIG. 10: Density of states g(E) for two designing sequences 
(d) with N = 18, m H = 8, and _E min = -9 in the HP model. 
We have divided out the symmetry factor 6 that is common 
to all conformations. Three-dimensional conformations have 
an additional symmetry factor 8, such that the states with 
minimal energy for these two curves are indeed unique and the 
sequences are designing. For comparison we have also plotted 
g(E) for one exemplified non-designing sequence (n) out of 
525 having the same properties as quoted above, but different 
sequences. The ground-state degeneracy for this example is 
go = g(E m i U ) — 6 x 1840 (including all symmetries). 



V. DENSITY OF STATES AND 
THERMODYNAMICS 



In this section we systematically compare ener- 
getic thermodynamic quantities of designing and non- 
designing sequences. In Ref. j4?| it was conjectured 
for exemplified sequences of comparable 14mers, one of 
them being designing, that designing sequences in the 
HP model seem to show up a much more pronounced 
low-temperature peak in the specific heat than the non- 
designing examples. This peak may be interpreted as 
kind of a conformational transition between structures 
with compact hydrophobic cores (ground states) and 
states where the whole conformation is highly compact 
(globules) [2^. l30l |. Another peak in the specific heat 
at higher temperatures, which is exhibited by all lattice 
proteins, is an indication for the usual globule - coil tran- 
sition between compact and untangled conformations. 

In order to study energetic thermodynamic quanti- 
ties such as mean energy and specific heat we deter- 
mined from our enumerated conformations for a given 
sequence the density of states g(E) that conveniently 
allows the calculation of the partition sum Z(T) — 
^2 E g(E) exp(— i^/fcsT) and the moments (E k )T — 
Y, E E k g(E) exp(-E/k B T)/Z, where the subscript T in- 
dicates the difference of calculating thermal mean val- 
ues based on the Boltzmann probability from averages 
previously introduced in this paper. Then the specific 
heat is given by the fluctuation formula CV = ((E 2 )t — 
(E) 2 T )/k B T 2 . " 



A. Sequences in the HP Model 

In the HP model with pure hydrophobic interaction 
the density of states shows up a monotonic growth with 
increasing energy, at least for the short chains in our 
study (for longer chains, e.g. the 42mer investigated in 
Ref. [2^, |3(j, the number of states in the high-energy 
region decreases, i.e., the density of states possesses a 
global maximum at an energy E residing within the inter- 
val -E m i rl < E < E max = 0). For a reasonable comparison 
of the behavior of designing and non-designing sequences, 
we have focused on 18mers having the same hydropho- 
bicity (rriH = 8) and ground-state energy E m - ln = —9. 
There are in total 527 sequences with these properties, 
two of which are designing. The densities of states for 
the two designing sequences and an example of a non- 
designing sequence are plotted in Fig. ^3 We have al- 
ready divided out a global symmetry factor 6 (number 
of possible directions for the link connecting the first 
two monomers) that all conformations on a s.c. lattice 
have in common. Since the ground-state conformations 
of the designing sequences spread into all three dimen- 
sions, an additional symmetry factor 4 x 2 = 8 (4 for 
rotations around the first bond, 2 for a remaining in- 
dependent reflection) makes a number of conformations 
obsolete and the ground-state degeneracy of the design- 
ing sequences is indeed unity. Obviously this is not the 
case for the sequences we identified as non-designing. In 
fact, the uniqueness of the ground states of designing se- 
quences is a remarkable property as there are not less 
than ~ 10 10 possible conformations of HP lattice pro- 
teins with 18 monomers. As we also see in Fig. ^3 the 
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FIG. 11: (a) Mean energy {E)t and (b) specific heat Cv 
for the two designing sequences with N = 18, run = 8, and 
-Emm = —9 (solid lines) in the HP model, whose densities 
of states were plotted in Fig. 1101 The curves of the same 
quantities for the 525 non-designing sequences are completely 
included within the respective areas between the dashed lines. 
The low-temperature peak of the specific heat (near T = 0.14) 
is most pronounced for the two designing sequences which 
behave similarly for low temperatures. 



ratio of the density of the first excited state (E = —8) for 
the designing and the non-designing sequences is smaller 
than for the ground state. This means that, at least 
for these short chains, the low-temperature behavior of 
the HP proteins in this model strongly depends on the 
degeneracy of the ground state. Furthermore, we ex- 
pect that the low-temperature behavior of both design- 
ing sequences is very similar as their low-energy densi- 
ties hardly differ. We have investigated this, once more 
for the 18mers with the properties described above, by 
considering the mean energy (E)t and the specific heat 
Cy(T). The results are shown in Figs. Illf a) and (b), 
respectively. The two solid curves belong to the two 
designing sequences and the dashed lines are the mini- 
mum/maximum bounds of the respective quantities for 
the non-designing sequences. As a main result we find 
that designing and non-designing sequences behave in- 
deed differently for very low temperatures. There is a 
characteristic, pronounced low-temperature peak in the 
specific heat that can be interpreted as kind of transi- 
tion between low-energy states with hydrophobic core 
and very compact globules. This confirms a similar ob- 
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FIG. 12: Minimum and maximum boundaries for the densi- 
ties of states of the 13 designing (filled boxes, connected by 
solid lines) and the 40 non-designing sequences (open circles, 
connected by dashed lines) with 18 monomers, hydrophobic- 
ity run ~ 3, and ground-state energy E m i n ~ —5.478 in the 
MHP model. Once more, a global symmetry factor 6 has 
already been divided out. 

servation for the 14mers studied in Ref. |49|. 

The upper bound of the specific heats for non- 
designing sequences in Fig. ITlT b) exposes two peaks. By 
analyzing our data for all 525 non-designing sequences we 
found that there are two groups: some of them experi- 
ence two conformational transitions, while others do not 
show a characteristic low-temperature behavior. Thus, 
the only appearance of these two peaks is not a unique, 
characteristic property of designing sequences. In order 
to quantify this observation, we have studied all relevant 
32896 sequences with 16 monomers. Only one of these 
sequences is designing (HP 2 HP 2 HPHPH 2 PHPH , with 
minimum energy E m i n = —9), but in total there are 
593 sequences, i.e., 1.8% of the relevant sequences, corre- 
sponding to curves of specific heats with two local max- 
ima. It should be noted that the degeneracies of the 
ground states associated with these sequences are com- 
parably small. 



B. Properties of the MHP Model 

In the MHP model, the energy levels are no longer 
equally spaced due to the additional non-integer HP in- 
teraction. Moreover, the absolute value of the energy 
of the lattice heteropolymer is not necessarily identical 
with the number of hydrophobic contacts. The forma- 
tion of a highly compact core is still desirable, but also 
the attractive contacts between H and P monomers re- 
duce the energy of the heteropolymer. For this reason, 
the relatively manifest distinction between "phases" with 
compact H-core states and entirely compact conforma- 
tions is expected to be much less pronounced, even for 
the designing sequences. 

Once more, we have first enumerated the densities of 
states for a set of sequences that have similar proper- 
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FIG. 13: Minimum and maximum boundaries of the (a) mean 
energy {E)t and (b) specific heat CV for the designing (solid 
lines) and non-designing sequences (dashed lines) in the MHP 
model, with the properties listed in the caption of Fig. 1121 



ties but differ only in the ordering of the sequence. For 
this study we chose all 18mers sharing the same hy- 
drophobicity m# = 3 and identical ground-state energy 
E min = 2w^g p + 8u^ p w -5.478, since there are 2 HH 
and 8 HP contacts in each of the ground-state confor- 
mations. We found 13 designing and 40 non-designing 
sequences that satisfy these specifications. In order not 
to be confused by too many curves, in Fig. ll2l again only 
the minimum and maximum boundaries of the design- 
ing sequences (solid lines) are shown as well as the corre- 
sponding bounds for the non-designing sequences (dashed 
lines). We observe that the regions enclosed by the 
boundaries do not exhibit significant differences in both 
cases except for the ground-state energy £7 m i n , where 
the ground-state degeneracy for the designing sequences 
is 5o = So min = Soma* = 48 ( Le -> identical with the 
global symmetry factor for 3D conformations), while the 
degeneracies for the non-designing sequences lie within 
the interval g$ min = 96 < g% < g$ m ^ = 3888. Note 
that < in (£ « -3.913) = and g< a (E « -4.913) = 
9min( E ~ -4.913) = 0. Interestingly, the state with 
energy E sa —3.913 is never occupied by the designing 
sequences. 

Since the densities of states for designing and non- 
designing sequences hardly differ, it is difficult to identify 
a particular thermodynamic behavior being characteris- 
tic for designing sequences only. This is indeed true as 



can be seen from Figs. I13f a) and (b), where we have 
plotted the lower and upper boundaries for the respec- 
tive mean energies and specific heats of these 18mers. In 
contrast to the results for the HP model (cf. Figs. Illf al 
and (b)), where, within a certain low-temperature inter- 
val, the regions enclosing the curves for the designing and 
non-designing sequences do not overlap, a separation of 
this kind is not apparent in the MHP model. Never- 
theless, these figures also show that for very low tem- 
peratures (0 < T < 0.1), the general behavior is very 
similar for all designing sequences, but it is not for the 
non-designing sequences, where the temperature depen- 
dence of energy and thus specific heat can be significantly 
different. 



VI. SUMMARY 

We have exactly analyzed the combined space of se- 
quences and conformations for proteins on the simple cu- 
bic lattice for two HP-type models that differ in the con- 
tact energy between hydrophobic and polar monomers. 
In the original HP model 0] this interaction is zero, while 
in the more realistic MHP model there is a nonzero 
contribution as suggested by the Miyazawa- Jernigan ma- 
trix of contact energies between amino acids in pro- 
teins 34]. Since there were only a few known exact 
results for heteropolymers in 3D, in particular on com- 
pact cuboid lattices, we generated by exact enumeration 
the sets of designing sequences and native conformations 
on non-compact simple cubic lattices. We studied, how 
their properties, measured, e.g., in terms of quantities 
like end-to-end distance, radius of gyration, designabil- 
ity, etc., differ from the bulk of all possible sequences and 
all self-avoiding conformations, respectively. We found 
that ground states of designing sequences, i.e. native con- 
formations, have a much smaller mean end-to-end dis- 
tance than the set of all conformations with the same 
length. Moreover, we confirmed that these conforma- 
tions are very compact, i.e., they have a smaller mean 
radius of gyration than the whole set. This is valid for 
both models under consideration. 

We have also studied energetic thermodynamic prop- 
erties, in order to investigate how characteristic the low- 
temperature behavior of designing compared to non- 
designing sequences is. We determined the densities of 
states for respective sets of selected 18mers with compa- 
rable properties. In the HP model, where the number 
of designing sequences is rather small compared with the 
MHP model, we could observe that energetic fluctuations 
are different for designing and non-designing sequences 
within a certain low-temperature region. Designing se- 
quences show up a more pronounced low-temperature 
peak in the specific heat being related to a conformational 
transition between low-energy states with hydrophobic 
core and highly compact globules. For the MHP model 
the situation is more diffuse, and a clear distinction be- 
tween designing and non-designing sequences based on 
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characteristic thermodynamic properties is not uniquely 
possible. Nevertheless, we have also seen in this model 
that designing sequences behave similarly for very low 
temperature while non-designing sequences react quite 
differently on changes of the temperature, over the entire 
range of temperatures. 
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